Advanced topic 5, 2012
Topic 5 from Advanced Topics 2012. SYMIAN: Analysis and Performance Improvement of the IT Incident Management Process. Background: Know: Recognize: IT Incident Management Process. This page presents SYMIAN, a decision support tool for the performance improvement of the incident management function in IT support organizations. I.INTRODUCTION According to IT Infrastructure Library (ITIL), Incident Management is defined as a process forrestoring normal service operation after a disruption, as quickly as possible and with minimum impact on the business. However, the complexity of real- life enterprise-class IT support organizations makes it extremely difficult to understand the effect of organizational, structural and behavioural components on the performance of the currently adopted incident management strategy and, consequently, which actions could improve it. SYMIAN (SYMulation for Incident ANalysis), as a decision support tool for the performance analysis and optimization of the incident management function in IT support organizations, enables its users to build an accurate model of real-life IT support organizations by using a discrete simulator, to evaluate their performance in managing incidents, and to assess and improve likely improvements brought by organizational, structural and behavioural changes. The experimental results illustrated the effectiveness of the SYMIAN-based performance analysis and tuning process. II. ANALYSIS OF THE INCIDENT MANAGEMENT IN IT SUPPORT ORGANITIONS Figure.1 on the right side shows a conceptual model of the IT support organization for incident management. As it shown, IT support organizations typically consist of a network of support groups, each comprising of a set of operators, with their work schedule. Support groups are divided into support levels (usually three to five), with lower level groups dealing with generic issues and higher level groups handling technical and time-consuming tasks. In particular, the Help Desk represents the interface for customers reporting an IT service disruption. In response to a customer request, the Help Desk opens ''an incident, sometimes also called trouble-ticket or simply ticket. The incident is then assigned to a specific support group. Figure.2 on the right shows the process of incident management. An incident passes through different states and is tackled by different support groups throughout its lifetime. At each of these steps, the incident record is updated with pertinent information. If, for some reason, customers request the organization to stop working on the incident, the incident is placed in a suspended state to avoid incurring SLO (Service Level Objective) penalties. Once the disruption is repaired, the ticket is placed in closed state until the end-user confirms that the service has been fully restored. In this case, the incident is resolved and its lifecycle ends (See Fig. 3 on the left). III. PERFORMANCE ANALYSIS AND BOTTLENECK LOCATION The adoption of a fine-grained model for IT support organizations allows for the definition of performance metrics that can accurately capture the business impact of service disruptions. More specifically, performance metrics should consider two orthogonal dimensions: the effectiveness of incident routing and the efficiency of every single support group in dealing with the incidents. Among the predefined performance metrics meant to determine the effectiveness of routing in IT support organizations, we consider: · number of reassignments per incident; · number of assignment cycles; · number of incidents seen twice or more at a given support group; · number of cross-level reassignments; · number of incident record updates (operator transactions) between (forward / back) reassignments; · number of inconclusive updates (operator transactions) at a single support group before the incident is bounced back to the originating support group; · time to closure after reassignments; · number of incidents that had an unusually large service time at a given support group and were then escalated to another support group. The set of predefined IT performance metrics aimed at measuring the efficiency of support groups in dealing with incidents is: ''∙ ''fan-in and fan-out of support groups; ''∙ ''mean incident sojourn time in support group; ''∙ ''number of incidents received vs. number of incidents resolved; ''∙ ''number of incidents treated; ''∙ ''number of operators that worked on the same ticket at each support group. Potential bottlenecks are located at support groups where tickets spend most of their time, support groups with a low incident resolution/escalation ratio, support groups processing a high number of incoming incidents, and finally support groups with many operators working on tickets. Here are several operations to IT managers for the optimization of IT support organizations: · increasing or cutting staffing levels; · transferring operators around support groups; · implementing different prioritization policies for incident queues; · implementing different operator selection policies. IV. SYMIAN: THE SYMIAN DECISION SUPPORT TOOL A. Parameter identification SYMIAN allows its users to build an accurate model of IT support organization, which contains number of support groups, the support levels, the set of operators, the operator work shifts, etc. SYMIAN models the IT support organization as a queuing network. We can consider each support group gi with i=1......N. Which is modeled as a G/G/si queue, which is a multiserver queue with random arrival and service times. SYMIAN uses a stochastic transition matrix (See Figure on the right). Each element ������ of the transition matrix �� represents the probability that a ticket will be tansfered from support group gi to support group gj. To consider the interactions of IT support organization with the outside world (arrival and departure of incidents), an extra virtual state 0 is introduced in the transition matrix. This allows to define the incident arrival vector (ai) = (toi) for �� > 0 and the incident closure vector (ci) = (tio) for �� > 0. Each support group should take other elements into consideration, such as queue of incoming tickets, a service time distribution, an operator policy, and incident prioritization policy,which will be affected by category and priority of incidents. The incident inter-arrival time can be regarded an Poisson and Non-Homogeneous Poisson Process in most cases. B. The SYMIAN Simulation Process 1.Preparation The modeled IT support organization get incident inter-arrival time distribution, and then transfer to the relative support group gi. The service time distribution for group gi can be computed. The simulator will try to arrange the incident at the top of queue. 2.Working condition If the operator is available, the incident is served. If the operator is busy, the incident is putback to the queue without changing the queue. If the operator's work shift ends before the incident service time is expired, the incident is put back in the incoming queue, thus the operator is off duty. When the incident service is fished, the operator is available. 3. Finish When the incident service finished, it has the probability ci to be closed, the necessary information is collected. On the other hand, it have probability tij to transfer to another support , entering a new queue to be served. V. USING SYMIAN FOR OPTIMIZING PERFORMANCE SYMIAN gives several options to optimize the performance of IT support organizations. Some of the operations available to IT managers, such as support group removal, support group creation, merging of two support groups, and splitting of a support group, have a major impact on the IT support organization and require the redefinition of the transition matrix. Other less invasive operations, such as support group re-staffing, work shift redefinition, incident prioritization and/or operator assignment policy modification, are also available. We describe the optimization options supported by SYMIAN in the following subsections. 1) Removing support groups When removing a support group ��, the arrival and closure vectors and the transition matrix will be updated to reflect the support group deletion. Supposing - without loss of generality - that �� be the ��−��ℎ group, the new closure vector is then given by (��′��) = (����)/(1−����) (excluding the trivial case where ���� = 1). A similartransformation is applied to the incident arrival vector. Thenew transition matrix is then (��′����) = (������)/(1 − ������) (again excluding trivial cases). 2) Creating support groups When creating a new support group (without loss of generality indexed �� + 1, the user will be required to provide the scalars ����+1 and ����+1, representing the incident arrival and closure probability at the group. The incident arrival and closure vectors get extended with ����+1 and ����+1 and renormalized as above. The rows of the transition matrix are first updated according to:(��′����) = (������) ∗ (1 − ����������) ∀��, �� ∈ 1 . . .��. At this point the matrix gets extended with the row (��′�� +1, ��) = (������) and the column (��′��,�� + 1) = (����������). 3) Merging support groups When merging two support groups ��1 and ��2, SYMIAN requires information on the volume of incidents processed at each group. The merging operation is equivalent to the removal of each group, followed by the creation of a new group that will have arrival, closure and transition probabilities calculated as follows. If �� = ����1/����2 is the ratio between the volume of incidents processed at each group, the arrival and closure probabilities of the addendum group will be ��⋅����1+(1−��)����2 and �� ⋅ ����1 + (1 − ��)����2. The (������) and (����������) vectors for the addendum group will be respectively (������) = �� ⋅ (����1,��) + (1 − ��)(����2,�� ) and (����������) = �� ⋅ (����,��1) + (1 − ��)(����,��2 ). 4) Splitting support groups When splitting an existing support group �� in two smaller support groups, SYMIAN will require the user to state the ratio �� of the incident volume that each new group is expected to have. SYMIAN will suggest setting this ratio at 1/2 by default. The splitting operation is equivalent to the removal of the old group, followed by the addition of two new groups that will have the arrival probabilities ���� ⋅ ��/(1+��) and ���� ⋅ 1/(1+��); the same closure probability as the original group ����; (������) vectors that are identical to the original group’s transition matrix column (������); and (����������) vectors that are given by �� ⋅ (������) and (1 − ��)(������), respectively. 5) Changing staffing levels, work shifts, and incident management policies SYMIAN allows the user to change staffing level in the IT organization support group. To this end, the tool will require the user to state, for each support group to consider, the new absolute value of staffing level or a multiplying constant that incrementally defines the new staffing level with respect to the previous one. In addition, IT managers can also change operator work shift, at both the support group level or at the single operator granularity. Among the available options, there are both 24x7 work shifts (where operators are always on duty) and 8-hourper- day work shifts that model service times more realistically, also considering the operator’s time zone of residence. Finally,SYMIAN allows IT managers to change the incident management policies at each support group. VI. SYMIAN: ARCHITECTURE AND IMPLEMENTATION Fig. 4, shows the main components of the tool: the Configuration Interface (CI), the User Interface (UI), the Configuration Manager (CM), the Parameter Identification Module (PIM), the Simulator Core (SC), the Data Collector (DC), the Trace Analyzer (TA), the Statistics Module (SM), and the Reporting Module (RM). More specific, the Configuration Interface component allows users to configure the IT support organization to simulate. The User Interface component allows users to load simulation parameters from a file, to change current simulation parameters, to save current simulation parameters to file, and to start simulations. The Configuration Manager takes care of the simulator configuration, enforcing the user-specified behaviors, e.g., with regards to verbosity of tracing information, and simulator parameters. The Parameter Identification Module provides statistical inference functions that can determine whether the samples in a given data set are distributed according to a known random variable distribution. The Simulator Core component implements the domain specific model. SC has three sub-components: Incident Generator (IG), Incident Response Coordinator (IRC) and Incident Processor (IP). The Data Collector component collects data from the simulation that can be post-processed to assess the performance of incident management in the modeled organization. Finally, the Statistics Module and the Reporting Module respectively provide basic statistics and reporting functions for the higher layer components. VII. EXPERIMENTAL RESULTS An experimental evaluation of the SYMIAN effectiveness in the performance analysis and optimization of a real-life IT support organization. For this experiment, we used data provided to us by the Outsourcing Services Division of HP. HP Outsourcing manages, among other IT services, the Help Desk function on behalf of various enterprise customers. The data used for this experiment comes from the subset of the organization serving a single enterprise customer from the financial services industry. A. 'Model Inference and Validation It obtains database logs of incidents for a 6-month period, consisting of data for more than 23,000 incidents. For each incident, the data carried transactional information about the arrival and departure times at each visited support group. Using SYMIAN statistical analysis and inference functions on transactional data, we were able to construct a reasonably accurate model of the ''BailUsOut ''IT support organization. First, we constructed the escalation matrix and derived the stochastic transition matrix by normalization. Then, we modeled each support group as a G/M/s first-come-first-served (FCFS) queue. In addition, transactional data did not contain information about actual service times, but only on the aggregate waiting plus service times. Thence, for each support group we had to estimate the mean service time parameter. Finally, to model incident inter-arrival times we used a random exponential probability distribution with a rate parameter estimated from transactional data. After compared the outcome of the simulation with the transactional data, it is verified that SYMIAN could reproduce the behavior of the ''BailUsOut ''IT support organization with reasonably good fidelity. Fig. 5 shows the comparison of the (empirical) cumulative distribution functions of total incident service times; Fig. 6 that of the visited support group numbers per incident; and Fig. 7 that of the number of received incidents at each support group for transactional data and simulation outcome. In order to verify the accuracy of the model, we performed astatistical null hypothesis analysis of the results. then analyzed the densities of historical and simulated service time trough kernel density estimation, and verified that the match is very good in the distribution tails (see Fig. 8), although not as good for low service times. This is also confirmed by the mismatch for low service times in Fig. 5. Similar results have been obtained for received incidents and hops. By applying kernel density estimation to analyze sojourn time at the various support groups, we have discovered that support groups in the ''BailUsOut ''IT support organization have different (usually three) service priorities. In fact, the inference of the number of operators in each support group and their allocation to work on incidents of different queues would be very challenging. In addition, the reallocation of workforce at the single support group level would require to consider a much larger number of parameters, therefore significantly complicating the performance tuning task for SYMIAN users. B. Evaluation of Configuration Changes In this part, this paper presents an experimental evaluation of the effectiveness of SYMIAN in the performance analysis and improvement of a real IT support organization, which is based on the model construction and validation process. SYMIAN is used to minimize the service disruption time in the BailUsOut organization, with the constrain of preserving the current number of operators. So the objectives of the performance improvement process are MICD(maximization of the mean incidents closed daily) metric, as well as MTTR(minimization of the mean time to resolution) metric. In order to locate the performance bottleneck, a formula was introduced to calculate the bottleneck score for each support group. BSi = ( FIi + FOi ) *RIi * WTi, where FIi and FOi are the fan in and fan out, RIi is the number of received incidents, WTi is the waiting time at support group SGi. A figure was drawn by using the formula, which is shown below: It is obvious that support group SG22 and SG39 are the major performance bottlenecks of the organization. Moreover, the uneven distribution illustrated that optimizing SG22 and SG39 might improve the performance of the BailUsOut IT support organization. In order to improve the organization performance, we used increasing the operator efficiency and emulating an improvement in operator performance. Then we launched a 40-round simulation to assess the impact for each change. The table below shows the comparison of performance metrics measured in the simulations of configuration changes to the BailUsOut IT support organization. The results provided for the MTTR and MICD metrics are mean values and 95% confidence intervals calculated over 40 simulation runs. In this table, the MTTR improvement is 17.18% and the MICD improvement is 2.05%. For the consideration of the effect of support group merging and splitting optimization options, we may merge SG22 and SG50. However, in fact, it might be infeasible in practice, SG50 is a support group shared with other IT support organizations while SG22 is a dedicated support group. Furthermore, SG22 is already with the highest workload. On the other hand, support groups with highest workloads such as SG22 and SG 39 show the best candidates for splitting. So in theory and some simulation data, the optimization option of merging seems to be feasible. However, in practice, a high workload support group cannot be merged with other groups. For the splitting one, there is limited data in this paper which may support this method. While in these performance improvement processes, they all went without business aspect which might bring some impact to them. In conclusion, small changes to the configuration of support groups with a large fan in and fan out can have a relatively large impact on the whole system behavior. It is expected that local optimization might not be a very effective practice in IT support organizations with strongly interconnected support groups and an even distribution of workload. In this case, it is needed to adopt different optimization strategies and consider the relationships between the different support groups. Additionally, in this part, what it mainly talked about may be related to the five functional areas for managing OSI systems "FCAPS". Especially the configuration management which is setting parameters that governs behavior and performance management which is measuring and recording system behavior. VIII. RELATED WORK For a detailed review of BDIM, in 5, some early works in BDIM include applications to change management, in 67, capacity management and place SLA design, in 8910, network security 11, and network configuration management 12. For the assimilating to approaches to business operation analysis that aim at improving business processes through collection of metrics and making inferences over them, some examples in 13 and simulation methods in 14. Diao ''et al.’s recent studies on the estimation of labor cost and business value of IT services from the analysis of process complexity in 1516. An article 3 has extensively studied the business impact of incident management strategies, using a methodology that moved from the definition of business-level objectives such as those commonly used in balanced scorecards in 17. The analysis of the incident management process and the IT support organization model in this paper are founded in 4. For modeling IT support organizations, in 21 22. WISE 23 proposed what-if analysis, it represents an interesting approach to estimate the outcome of complex network management operations before putting them in practice. Queuing network-based models have been applied in a broad spectrum of research area such as computing, communications, transportation systems, health care, manufacturing systems, and supply chain systems, which can be fonded in 24, 25, 26, 27, 28 and 29 respectively. IX. CONCLUSIONS AND FUTURE WORK After all of these works we can see that it is very complex to optimize the performance of large-scale IT support organizations. This paper presented the SYMIAN tool for the performance optimization of incident management in IT support organizations. The application of SYMIAN in real-life IT support organizations demonstrates the tool effectiveness in the performance analysis and improvement process. In the SYMIAN evaluation process, open queuing network models could reproduce the behavior of real-life IT support organizations with a very high degree of accuracy. The results call for further study, which could bring to a deeper understanding of the performance of the incident management function in IT support organizations. Moreover, the SYMIAN decision support tool may be very useful in commercial applications. Reference 1: ↑ C.Bartolini, C. Stefanelli, and M. Tortonesi, "SYMIAN: Analysis and Performance Improvement of IT Incident Management Process", IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 7, NO. 3, SEPTEMBER 2010. 3 C. Bartolini, M. Sallé, and D. Trastour, “IT service management driven by business objectives—an application to incident management," in Proc. IEEE/IFIP Network Operations and Management Symposium (NOMS 2006), Apr. 2006. 4 G. Barash, C. Bartolini, and L. Wu, “Measuring and improving the performance of an IT support organization in managing service incidents," in Proc. 2nd IEEE Workshop on Business-driven IT Management (BDIM 2007), Munich, Germany, 2007. 5 A. Moura, J. Sauvé, and C. Bartolini, “Research challenges of businessdriven IT management," in Proc. 2nd IEEE/IFIP International Workshop On Business-Driven IT Management (BDIM 2007), Munich, Germany. 6 A. Keller, J. Hellerstein, J. L. Wolf, K. Wu, and V. Krishnan, “The CHAMPS system: change management with planning and scheduling," in Proc. IEEE/IFIP Network Operations and Management Symposium (NOMS 2004), Apr. 2004 7 J. Sauvé, R. Rebouças, A. Moura, C. Bartolini, A. Boulmakoul, and D. Trastour, “Business-driven decision support for change management: planning and scheduling of changes," in Proc. DSOM 2006, Dublin, Ireland 8 S. Aiber, D. Gilat, A. Landau, N. Razinkov, A. Sela, and S. Wasserkrug, “Autonomic self–optimization according to business objectives," in Proc. International Conference on Autonomic Computing, 2004. 9 D. Menascé, V. A. F. Almeida, R. Fonseca, and M. A. Mendes, “Business-oriented resource management policies for e-commerce servers," Performance Evaluation 42. Elsevier Science, 2000, pp. 223- 239. 10 J. Sauvé, F. Marques, A. Moura, M. Sampaio, J. Jornada, and E. Radziuk, “SLA design from a business perspective," in Proc. DSOM 2005. 11 H. Wei, D. Frinke, O. Carter, et al., “Cost–benefit analysis for network intrusion detection systems," in Proc. 28th Annual Computer Security Conference, Oct. 2001. 12 R. Boutaba, J. Xiao, and I. Aib, “CyberPlanner: a comprehensive toolkit for network service providers," in Proc. 11th IEEE/IFIP Network Operation and Management Symposium (NOMS 2008), Salvador de Bahia, Brazil. 13 F. Casati, M. Castellanos, U. Dayal, and M. C. Shan, “A metric definition, computation, and reporting model for business operation analysis," in Proc. Advances in Database Technology - EDBT 2006, 10thInternational Conference on Extending Database Technology, Munich, Germany, Mar. 2006 14 K. Tumay, “Business process simulation," in Proc. Simulation Conference 1995, Winter Volume, Dec 1995 pp, 55–60. 15 Y. Diao, A. Keller, S. Parekh, and V. Marinov, “Predicting labor cost through IT management complexity metrics," in Proc. 10th IEEE/IFIP Symposium on Integrated Management (IM 2007), Munich, Germany. 16 Y. Diao and K. Bhattacharya, “Estimating business value of IT services through process complexity analysis," in Proc. 11th IEEE/IFIP Network Operation and Management Symposium (NOMS 2008), Salvador de Bahia, Brazil. 17 R. Kaplan and D. Norton, “The balanced scorecard: measures that drive performance," Harvard Business Review, vol. 70, no. 1, pp. 71-79, 1992. 21 Q. Shao, Y. Chen, S. Tao, X. Yan, and N. Anerousis, “Efficient ticket routing by resolution sequence mining," in Proc. 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08), Las Vegas, NV, USA, Aug. 2008. 22 Q. Shao, Y. Chen, S. Tao, X. Yan, and N. Anerousis, “EasyTicket: a ticket routing recommendation engine for enterprise problem resolution," in Proc. 34th International Conference on Very Large Data Bases (VLDB’08), Auckland, New Zealand, Aug. 2008. 23 M. Bin Tariq, A. Zeitoun, V. Valancius, N. Feamster, and M. Ammar, “Answering what-if deployment and configuration questions with WISE." 24 D. Liu, Y.-D. Cao, and C.-Q. Li, “Liana: a decentralized load-dependent scheduler for performance-cost optimization of grid service," J. Supercomputing, vol. 49, no. 1, pp. 127-156, July 2009. 25 N. Bisnik and A. Abouzeid, “Queuing network models for delay analysis of multihop wireless ad hoc networks," Ad Hoc Networks, vol. 7, no. 1, pp. 79-97, Jan. 2009. 26 D. Mitchell and J. MacGregor Smith, “Topological network design of pedestrian networks," Transportation Research Part B: Methodological, vol. 35, no. 2, Feb. 2001. 27 N. Koizumi, E. Kuno, and T. Smith, “Modeling patient flows using a queuing network with blocking," Health Care Management Science, vol. 8, no. 1, Feb. 2005. 28 V. Bhaskar and P. Lallement, “A four-input three-stage queuing network approach to model an industrial system," Applied Mathematical Modelling, vol. 33, no. 8, Aug. 2009. 29 Q. Gong, K. K. Lai, and S. Wang, “Supply chain networks: closed Jackson network models and properties," International Journal of Production Economics, vol. 113, no. 2, June 2008.